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Abstract 

We introduce Mean Field Markov games with TV players, in which each 
individual in a large population interacts with other randomly selected 
players. The states and actions of each player in an interaction together 
determine the instantaneous payofT for all involved players. They also 
determine the transition probabilities to move to the next state. Each 
individual wishes to maximize the total expected discounted payoff over 
an infinite horizon. We provide a rigorous derivation of the asymptotic 
behavior of this system as the size of the population grows to infinity. 
Under indistinguishability per type assumption, we show that under any 
Markov strategy, the random process consisting of one specific player and 
the remaining population converges weakly to a jump process driven by 
the solution of a system of differential equations. We characterize the 
solutions to the team and to the game problems at the limit of infinite 
population and use these to construct near optimal strategies for the case 
of a finite, but large, number of players. We show that the large population 
asymptotic of the microscopic model is equivalent to a (macroscopic) mean 
field stochastic game in which a local interaction is described by a single 
player against a population profile (the mean field limit). We illustrate 
our model to derive the equations for a dynamic evolutionary Hawk and 
Dove game with energy level. 



1 Introduction 

We consider a large population of players in whieh frequent interaetions oecur 
between small numbers of ehosen individuals. Each interaction in which a player 
is involved can be described as one stage of a dynamic game. The state and 
actions of the players at each stage determine an immediate payoff (also called 
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fitness in behavioral ecology) for each player as well as the transition proba- 
bilities of a controlled Markov chain associated with each player. Each player 
wishes to maximize its expected fitness averaged over time. 

This model extends the basic evolutionary games by introducing a controlled 
state that characterizes each player. The stochastic dynamic games at each 
interaction replace the matrix games, and the objective of maximizing the ex- 
pected long-term payoff over an infinite time horizon replaces the objective of 
maximizing the outcome of a matrix game. Instead of a choice of a (possibly 
mixed) action, a player is now faced with the choice of decision rules (called 
strategies) that determine what actions should be chosen at a given interaction 
for given present and past observations. 

This model with a finite number of players, called a mean field interaction 
model, is in general difficult to analyze because of the huge state space required 
to describe the sate of all players. Then, taking the asymptotics as the number 
of players grows to infinity, the whole behavior of the population is replaced by 
a deterministic limit that represents the system's state, which is fraction of the 
population at each individual state that use a given action. 

In this paper we study the asymptotic dynamic behavior of the system in 
which the population profile evolves in time. For large N, under mild assump- 
tions (see Section [3]), the mean field converges to a deterministic measure that 
satisfies a non-linear ordinary differential equation for under any stationary 
strategy. We show that the mean field interaction is asymptotically equivalent 
to a Markov decision evolutionary game. When the rest of the population uses 
a fixed strategy u, any given player sees an equivalent game against a collective 
of players whose state evolves according to the ordinary differential equation 
(ODE) which we explicitly compute. In addition to providing the exact limiting 
asymptotic, the ODE approach provides tight approximations for fixed large N. 
The mean field asymptotic calculations for large N for given choices of strategies 
allows us to compute the equilibrium of the game in the asymptotic regime. 

1.1 Related Work 

Mean field interaction models have already been used in standard evolutionary 
games in a completely different context: that of evolutionary game dynamics 
(such as replicator dynamics) see e.g. and references therein. The paradigm 
there has been to associate relative growth rate to actions according to the fitness 
they achieved, then study the asymptotic trajectories of the state of the system, 
i.e. the fraction of users that adopt the different actions. Non-atomic Markov 
Decision Games have been studied in [1] and applied in [12| to firm idiosyncratic 
random shocks using decentralized strategies. They proposed the notion of 
oblivious equilibria via a mean field approximation. Extension to unbounded 
cost function can be found in [3]. Applications to cellular communications can 
found in [4]. 

Most of these approaches considered the case where the payoff of a player 
depends on the states of the other players but not explicitly on the actions of the 
others. In this paper, the payoff depends explicitly on both states and actions 



of the other players. 



1.2 Structure 

The remainder of this paper is organized as foUows. In next section we present 
the model assumptions and notations. In Section|3]we present some convergence 
results of the ODE in the random number of interacting players. In Section |4] a 
resource competition between animals with two types of behaviors and several 
states is presented. All the sketch of proofs are given in Appendix. Section [5] 
concludes the paper. 

2 Model description 

2.1 Mean Field Markov Process With Players 

We consider the following model, which we call Mean Field Markov Game with 
N players. 

• There are g N players. 

• Each player has its own state. A state has two components: the type of 
the player and the internal state. The type is a constant during the game. The 
state of player j at time t is denoted by {t) = {9j,Sj'{t)) where 9j is the 
type. The set of possible states X ^ {I, . . . ,0} x 5 is finite. 

• Time is discrete, taking values in jj := {0, jj, j^, . . .}. 

• The global detailed description of the system at time t is (t) ~ {X^ (t), . . . , X 
Define {t) to be the current population profile i.e {t) = X^jLi l{x"(i)=K 

At each time t, {t) is in the finite set {0, -i, . . . , l}"'^, and Afg^^(t) is the 
fraction of players who belong to population of type 6 (also called subpopula- 
tion 9) and have internal state s. Also let = N'Y^^^g M^^{t) be the size of 
subpopulation 6 (independent of t by hypothesis). We do not make any specific 

hypothesis on the ratios — ^ as N gets large (it may be constant or not, it may 
tend to or not). 

• Strategies and local interaction: At time slot t, an ordered list B^{t), 
of players in {1,2,..., N}, without repetition, is selected randomly as follows. 
First we draw a random number of players K{t) such that 

¥{K{t) ^ k\M^{t) = m) = Jf (m) 

where the distribution J^{rn) is given for any N, m G {0, . . . , l}""^. 

Second, we set to an ordered list of K{t) players drawn uniformly at random 
among the N{N — 1)...(A^ — K{t) + 1) possible ones. By abuse of notation we 
write j S B^{t) with the meaning that j appears in the list B^{t). 

Each player such that j G B^{t) takes part in a one-shot event at time t, as 
follows. First, the player chooses an action a in the finite set A with probability 
ue{a\s) where {9,s) is the current player state. The stochastic array u is the 
strategy profile of the population, and ue is the strategy of subpopulation 9. 



A vector of probability distributions u which depend only on the type of the 
player and its internal state is called stationary strategy. 

Second, say that {t) = {ji, . . . , jk). Given the actions Oj^ , Cj,, drawn by 
the k players, we draw a new set of internal states (s^-^ , s^-^ ) with probability 

where = {6^,, ...,9j,), s = {sj,, sj,) 
a = (flji, ...,ajj, s' = {s'j^,...,SjJ 

Then the collection of k players makes one synchronized transition, such that 

+ = z = l,...,fc 

Note that Sf{t + ^) S'f (t) if j is not in B^{t). 

It can easily be shown that this form of interaction has following properties: 
(1) is Markov and (2) players can be observed only through their state. 

The model is entirely specified by the probability distributions J^, the 
Markov transition kernels and the strategy profile u. In this paper, we 
assume that and L'^ arc fixed for all N, but u can be changed and does 
not depend on N (though it would be trivial to extend our results to strategies 
that depend on TV, but this appears to be unnecessary complication). We are 
interested in large N. 

It follows from our assumptions that 

1. M^{t) is Markov. 

2. for any fixed j e {I,..., N}, {X f (t) , M ^ {t)) is Markov. This means that 
the evolution of one specific player X^ {t) depends on the other players 
only through the occupancy measure [t). 

2.2 Payoffs 

We consider two types of instantaneous payoff and one discounted payoff: 

• Instant Gain: This is the random gain Gf{t) obtained by one player 
whenever it is involved in an event at time t. We assume that it depends on this 
player's state just before the event and just after the event, the chosen action, 
and on the states and actions of all players involved in this event. Formally, if 
player j e 

Cf (t) = g'^ixj , aj , x'j , xgN , asN(^t}\j , a^e" {t)\j ) 

where Xj = X^^ (t), aj is the action chosen by player j, x'^ = X^^ {t+ j^), a;giv(t-)\j 
[resp. a^giv(t)\j] is the list of states at time t [resp. at time t + jj] of players 
other than j involved in the event, asN(^t}\j is the list of their actions and g{) is 
some non random function defined on the set of appropriate lists. Whenever j 
is not in B^{t), {t) = 0. We assume that G^ [t) is bounded, i.e. there is a 
non random number Co such that, with probability 1: for all j,t: \G^ {t)\ < Go 



• Expected Instant Payoff: It is defined as the expected instant gain of player 
J, given tlie state x of j and the population profile to. By our indistinguishability 
assumption, it does not depend on the identity of a player, so we can write it as 

r^{u, X, m) E (Cf (i) \X^{t) = x, M^{t) ^ m) 

Note that this conditional expectation contains the case when j is not in B^{t), 
i.e. when Gf{t) = 0. 

• Discounted Long-Term Payoff: It is defined as the expected discounted 
long term payoff of one player, given the initial state of this player and the 
population: r^(u\x,rn) := 

oo 

E( ^ e~^'Gf{t)\Xj{Q) = a;,M^(0) = to) 
t=o step i/jv 

where /3 is a positive parameter (existence follows from the boundedness of 
Gf). The fact that it does not depend on the identity j of the player, but 
only on its initial state x and the initial population profile m, follows from the 
indistinguishability assumption. 

We defined the Discounted Long-Term Payoff in terms of the instant gain, 
as this is the most natural definition. The following proposition shows that the 
alternative definition, by means of the expected instant payoff, is equivalent. 

Proposition 2.2.1. For all player state x and population profile rn 

oo 

f^{u;x,rn) = E( ^ e-^*r^ {u, (t), (t)) 
t=o step i/N 
\Xj{0) = x,M^{0) = to) 

2.3 Focus on One Single Player 

We are interested in the following special case (here we make the dependency 
on the strategy explicit). There are two types of players, i.e. 8 = 2. There is 
exactly one player (the player of interest) with type 1. All other players have 
type 2. In this case we use the notation {ui,U2', s,rn) for the discounted 
long-term payoff obtained by the player in type 0, when her strategy is ui and 
all other playcrs's strategy is U2, given that this player's initial internal state is 
s and the initial type 2 subpopulation profile is to. Note that 

R^{ui,U2; s, to) = f^(Mi, -U2; (1, s),rn') 

with m'-^ J,, = jjls=s' and m'^ g, = TO2.s' for all s' £ S. 

Mean Field Markov Game 

Player j may choose a strategy Uj which laws depends on its type and its own- 
internal state. We look for a (Nash) equilibrium u such that if all players use 



u then no player has an incentive to deviate from u. For any finite A'' one can 
map this into a standard Markov game. This is true for both the case where the 
number of players is known and in the case it is unknown when taking a decision. 
Therefore we know that a stationary equilibrium exists in the discounted case. 
A stationary equilibrium is solution of the fixed point equation: 

Vj, Uj g G arg max (vj g,u-j; s,m) 

By assuming indistinguishability per type we can show that a stationary equi- 
librium exists which is a solution of the fixed point equation 

yO,ug e argmaxi?^(we, w; s, m) 

Note that the mean field optimality here refers to the maximization of 
s,to) over symmetric and stationary strategics. It is not necessarily 
optimal in the global sense. 

Mean Field Markov Team 

We wish to find a stationary u that maximizes averaged over all players. 
u = {ui, . . . , uq) G argmax_R^(u; s, m) 

V 

3 Main Results 

3.1 Scaling Assumptions 

We are interested in the large N regime and obtain that, for any fixed j, 
{Xj^,M^) converges weakly to a simple process. This requires the weak con- 
vergence of M^(0) to some toq. 

We assume that the parameters of the model and the payoff per time unit 
converge as A*" 00, i.e. 

Jf (m) ^ Jfe(m) 

Le:s:^a-Ak,m) ^ Le;s;a;s:(fc, m) (1) 
(u, X, rh) — > r(u, x, rh) 

Our main scaling assumption is 

HI J2k k'^Jki'rn) < 00 for all rh E A. This ensures that the second moment of 
the number of players involved in an event per time slot is bounded. 

Note that HI excludes the case where the number of players involved in an 
event per time slot scales like N (i.e. synchronous transitions of all players at 
the same time). There may be large N asymptotic results for such cases [13] 
but the limit is not given by an ODE. In contrast, HI is automatically true if 
the number of players involved in an event per time slot is upper bounded by 
a non random constant. We also need some technical assumptions, which are 
usually true and can be verified by inspection. 



H2 J2k Jk{rn) > for all m G A (A is the simplex {rh : nig^s > 0, ^ mg^s = 
1}). This ensures that the mean number of players involved in an event 
per time slot, X]fe>o ^•^fe(w) is non zero. 

Define the drift of M^{t) as 

f^{u, m)^E (^M^{t + ^) - A'/^(0|Af^(f) = 

Note that we make explicit the dependency on the strategy u but not on J and 
L, assumed to be fixed. 

It follows from our hypotheses that 

Mm Nf^{u,m) := f{u,m) (2) 

N—^oo 

exists. 



H3 We assume that the convergence in Equation ^ is uniform in m and the 
limit is Lipschitz-continuous in to. This is in particular true if one can 
write, for every strategy u, f^{u,Th) = ■^(f>u{jj ,rn), with (/)„ defined on 
[0, e] X A where e > and $„ is continuously differentiable. 

H4 F{Xf{t + l/N) = y\X^{t) = x,M^{t) = m,M^{t + l/N) ^ to') converges 
uniformly in to, to' and the limit is Lipschitz-continuous in to, to'. This is in 
particular true if one can write, for every strategy u, as ^u,x;yii/N, to, to'). 
with ^ defined on [0, 1] x A x A and £,u,x;y is continuously differentiable. 

Our model satisfies the assumptions in [5], therefore we have the following 
result: 

Theorem 3.1.1 ([5]). Assume that limAT^oo A/"(0) = Too probability. 
For any stationary strategy u, and any time t, the random process AI^ (t) = 
■i- ^^^j^ (5^Y"(t) converges in distribution to the (non-random) solution of the 
ODE 

rh{t) = f{u,rn{t)) (3) 

with initial condition toq. 



3.2 Convergence results 

We focus on one player, without loss of generality we can call her player 1, 
and consider the process [X^ ,M^). For any finite N, Xi and are not 
independent, however in the limit we have the following: 

Theorem 3.2.1. Assume that limjv >oo M^(0) = toq and limjv yoo {Q) = 

= {01, So) in probability. The discrete time process {X(^ (t), (t)) defined 
for t € converges weakly to the continuous time jump and drift process 
{Xi{t),rn{t)), where rn(t) is solution of the ODE Equation (0j with initial con- 
dition Too and Xi (t) is a continuous time, non homogeneous jump process, with 



initial state xq. The rate of transition of Xi{t) from state xi = (0i,si) to state 
x'l = iOi,s[) is 



A(xi,x[] mi 



= ^ Jfc(m)Afc(, 



'si,s'i]m(t),u) 



k>l 



with Ak{si, s'l ;rn{t), u) 



k k 



(k, m{t)) Y[ ue, iaj\sj) J| mg^^s, (t) 



where 9 = {02, ■■■,0k),s = (s2,---,Sfc) 
a = (ai, ...,afc), s' = (s2,---,Sfc) 



Note that, contrary to results based on propagation of chaos, we do not 
assume that the distribution of player states at time is exchangeable. In 
contrast, we will use Theorem 13.2.11 precisely in the case where player 1 is 
different from other players. Theorem 13. 2. II motivates the following definition. 

Definition 3.3. To a game as defined in Section \2.1\ we associate a "Macro- 
scopic Mean Field Markov Game", defined as follows. There is one player, 
(player 1), with state Xi{t) and a population profile rhit). The initial condition 
of the game is ^i(O) = x, m(0) = itiq. The population profile is solution to the 
ODE 0^ and Xi it) evolves as a jump process Theorem \3.2.1\ 

Further, let f{u; x, m) be the discounted long-term payoff of player 1 in this 
game, given that Xi[Q) = x and m(0) = rfiQ, i.e. f{u;x,rn) = 



We also consider, as in Section [273[ the case with = 2 types and define by 
analogy R{ui,U2\ s,rh) as the discounted long-term payoff when player 1 starts 
in state s and the population profile starts in state rn, with player 1 using strategy 
ui and other players strategy U2. 

In order to exploit the convergence in distribution of the process focused 
on one player, we need that the payoff be continuous in the topology of this 
convergence. This is stated in the next theorem. 

Theorem 3.3.1. Let E = S x A and De[0, oo) the set of cadlag functions from 
[0, oo) to M, equipped with Skorohod's topology. The mapping 



E 




D£;[0,oo) R 




is continuous. 



Using Theorem 13.2.11 and Theorem 13.3.11 we obtain the following, which is 
the main result of this paper. It says that when N goes to infinity, the Mean 
Field Markov Game with N{t) of players becomes equivalent to the associated 
Macroscopic Mean Field Markov Game. This reduces any multi-player problem 
into an effective one-player problem facing an evolving aggregative object. 

Theorem 3.3.2 (Asymptotically equivalent game). When N goes to infinity 
we have (a) the discrete time process converges in distribution to the contin- 
uous time process Xi (b) f^{u;x,rn) f{u;x,rn) and (c) {ui,U2', s,rn) — >■ 
i?(ui,W2; s,m) 

3.4 Case with Global Attractor 

Assume that, for some strategy u, the ODE ^ has a global attractor m* (this 
may or may not hold, depending on the ODE). If in addition the model with 
N players is irreducible, with stationary probability distribution for , 

then lim^r — ^ih' where is the Dirac mass at m* (follows from 

[5]). i.e. the large time distribution of (t) converges, as — !• cxd, to the 
attractor m*. 

Also, {Xj'{t),M^{t)) converges to a continuous time, homogeneous Markov 
jump process with time-independent transition matrix: 

A{xi,x[;u) ^^^Jk {Tn)Ak (si , s[ ■,m,*,u) 

k>l 

Assume that the transition matrix A{xi,x'i;u) is also irreducible and let 7r() 
be its unique stationary probability. Also let tt^ be the first marginal of the 
stationary probability of {X^ , M^). It is natural in this case to replace the 
definition of the long term payoffs (ui, U2] s, rfi) and R^ {ui, U2\ s, rh) by their 
stationary counterparts 

R^f.{ui,U2) := 7r^(s)i?^(Mi, M2; s, m*) 
Rst{ui,U2) ^TT{s)R{ui,U2;s,rn*) 

3.5 Single player per type selected per time slot 

Consider the special case where at each time slot, only one player per type 
between the N is randomly selected and has a chance to change its action, i.e. 
tlS^ = 1 w.p 1. 

Thus HI and 112 arc automatically satisfied. The resulting ODE (see [6]) 
becomes 

^m2;(i) = ^m3;/L:r',x("5-,u, 9) - m^; ^ i:!,^:' (?«, m, 6) 

x' x' 



The term nix'Lx' ^xijn, u, Q) is the incoming flow in to x and the outgoing 
flow from X is rux J2x' Lx,x'irn,u, 9). 

We then obtain a large class of state- dependent evolutionary game dynamics. 
Note that in general the trajectories of the mean dynamics need not to converge. 
In the case of single player selected in each time slot of 1 /N and linear transition 
in m, the time averages under the replicator dynamics converge its interior rest 
points or the boundaries of the simplex. 



3.6 Equilibrium and optimality 

Let Us be the set of strategies. Consider the optimal control problems 



(OPTn) 



Maximize {u, u; s, toq) 
s.t u GUs 



Maximize R{u, u; s, toq) 
s.t u €Us 



The strategy u is an e— optimal strategy for the iV-optimal control problem if 
{u, u; s, rn-o) > — e + sup v; s, mo). 

V 

Also consider the fixed-point problems 

(WT^r \ / ^'^'^ u gUs such that 

^ \ u e argmax,„g;^^{i?^(w, w; s, mo)} 



find u G Us such that 
u e argmaxi,gj^,{i?(u, u; s, mo)} 



A solution to (FIXn) or (FIXoo) is a ( Nash) equilibrium. We say that u is an 
e— equilibrium for the game with N [resp. N — >■ oo] players if R^ {u, u; s, mp) > 



sup.„ R {v, u; s, mo) — e [resp. u; s, mo) > sup,„ R{v, u; s, mo) 

Note that the definition of equilibrium and optimal strategy may depend on 
the initial conditions. If, for any u Us, the hypotheses in Section [33] hold, 
then we may relax this dependency. 

Theorem 3.6.1 (Finite N). For every discount factor /3 > the optimal con- 
trol problem (OPTn) (resp. the fixed-point problem (FIXn)) has at least one 
0— optimal strategy (resp. 0^ equilibrium). In particular, there a e^-optimal 
strategy (resp. e^ — equilibrium) with ejv — > 0. 

Theorem 3.6.2 (Infinite N). Optimal strategies (resp. equilibrium strate- 
gies) exist in the limiting regime when N oo under uniform convergence 
and continuity of R^ — > R. Moreover, if {t/^} is a sequence of e^ — optimal 
strategies (resp. cn ^ equilibrium strategies) in the finite regime with cn — > e, 
then, any limit of subsequence U^^^'^ — >■ U is an e— optimal strategy (resp. 
e— equilibrium) for game with infinite N. 



3.7 Mean field equilibrium 



Each generic player 1 with strategy vi optimizes its own long-term payoff under 
the behavior of its own-internal state which is a continuous time Markov jump 
process driven by A{xi,x[;vi,rh{t),u) and the behavior of rn{t) is given by the 
controlled ODE under the strategy u. 

It is important to notice that at the infinite population limit the mean field 
limit dynamics does not depend on vi. This can be easily seen from the fact 
that the effect of a single player is in order of When N — > +00, the effect 
becomes negligible with the respect to the mass. However, vi can be a big effect 
in the rate transition of that player via A{xi, x[; vi, m{t), u). 

The consistency between the individual state transition and the fraction of 
players per state needs to be checked. 

We say that the pair {u^,rn*{t)) is a mean field equilibrium if {uf}t>o is a 
mean field response to the individual dynamic optimization where m* (t) is the 
mean field at time t and produces the mean field i.e m[u*,mo](t) = rn*{t). 

If {v* ,u* ,m*) satisfies the following equation 

f3vs.t{s, m) = sup < rgijjg, ug, rn{t))+ ^ A{s, s'; m{t),u)vg_t{s'g, m{t)) > + f{u*, m).dmVe,t 

me{t) = me.o + fg{ul,,m{t')) dt' 
m(0) = Too e A(A'), 61 e 9. 

and the strategy 

u*g ^ G argmax |rg(^e, ug, Tn{t))+ A{s, s'; ■m{t),u)vg^t{s'g, TO(t))| , 

then, one gets a mean field equilibrium. The problem becomes max.„jg;^^{i?(wi, it; a;i, toq)} 
subject to the transitions A{xi,x'i;vi,'m{t),u) and the ODE. 



4 Illustrating example 

We present in this section an example of a dynamic version of the Hawk and 
Dove problem where each individual has three energy levels. We derive the mean 
field limit for the case where all users follow a given policy and where possibly 
one player deviates. We then further simplify the model to only two energy 
states per player. In that case we are able to fully identify and compute the 
equilibrium in the limiting Mean Field Markov Game. Interestingly, we show 
that the ODE converges to a fixed point which depends on the initial condition 
and the policy. 

Consider an homogenous population of N animals. An animal plays the role 
of a player. Occasionally two animals find themselves in competition on the 
same piece of food. Each animal has three states x = 0, 1, 2 which represents its 
energy level. An animal can adopt an aggressive behavior (Hawk) or a peaceful 



one (Dove, passive attitude). At the state x = there is no action. We describe 
the fitness of an animal (some arbitrary player) associated with the possible 
outcomes of the meeting as a function of the decisions taken by each one of the 
two animals. The fitnesses represent the following: 

• An encounter Hawk-Dove or Dove- Hawk results in zero fitness to the Dove 
and in v of value for the Hawk that gets all the food without fight. The 
state of the Hawk (the winner) is incremented a = i{x'^=min{xH+i,2)} 

the state of the Dove is 6 = 1 {x'^=maK{x 0-1,0)} ■ 

• An encounter Dove-Dove results in a peaceful, equal-sharing of the food 
which translates to a fitness of | to each animal and the state of each 
animal change with the sum of the two distributions + 

• An encounter Hawk- Hawk results in a fight in which with p = 1/2 chances, 
one (resp. the other) animal obtains the food but also in which there is a 
positive probability for each one of the animals to be wounded 1 /2 . Then 
the fitness of the animal 1 is ^{v — c) + ^(— c) = — c, where the — c 
term represents the expected loss of fitness due to being injured. 
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The vector of frequencies of states at time t is given by It) = X]j=i 
for a; = 0, 1, 2 and the action set is Ax = {H, D} in each state x 0, Aq = {}. 

The assumptions in Section[3]arc satisfied (pairwisc interaction, ^B^{t) = 2) 
and the occupancy measure it) converges to m{t). 

4.1 ODE and Stationary strategies 

Consider the following fixed parameters /ii = io,i: M2 = ^0,2- The population 
profile is denoted by m = (mg, mi, 7712) and the stationary strategy is described 
by the parameters vi,V2 where vi := u{H\l), V2 ~ u(H\2) 

1112 = moLo,2 + miLi^2{u, m) - 7712^2, i(m, m) 

mi = moio,i + fn2L2,i{u, m) — miLi^2{u, m) — miLifi{u, m)) 

mo = miLio(M, m) - {^i + ^2)mo 



where ii2(u,m) = 

mo + vi h (1 - ui jmi H h (1 - W2)to2 j 

, . f {l-vi)mi (1 -^2)^2 ^ 
+il-vi)^ 2 + 2 ) 

fvimi V2m2\ 
L2AU, m) = V2 + j 

, /I ^( (1 - , , (1 - ^2)^2 ^ 

+ (1-1'2) ( ^ +^2^2 + ^ I 

Lio(u,m) := wi 1 — j 

, , (1 (1 ~i;2)^t2 ^ 
+ (1 - Wlj I WlTOi H ^ hW2m2H I, 

For B^-{ji,j2}, .t;.,.t, e{0,l,2}, 

Xi ,X2 ■,x'2 

Xi,X2,x[ 

^ ^ 'mx2Lx^x2;x'^,x'^{'^7 m) 
~Tnx ^ ^ fnx^Lx^^x;x\,x'2{'^Tm) 

4.2 Computation of U2; s, m). 

We want to compute the value 

V {ui,U2,x,m) ■.= ^x / e~^*r{ui,U2,x(t),m{t)) dt 
Jo 

s.t. m(t) = f(u2, mit)), m(0) = mo, x(0) = x. 
.A 

V{ui,U2,x,m) ^ Kx e ^'^r{ui,U2,x{t),m{t)) dt 
Jo 

+V.X / e~'^*r{ui,U2,x{t),m{t)) dt 
Ja 

/•A 

= Ex e ^*r(ui,U2,x(i),m(i)) 
Jo 

+E,e-'3^F(Mi,U2,a;(A),m(A)) 



This implies that 



1 '•^ 



= Exjr I e^^*r{ui,U2,x{t),m{t)) dt 
^ Jo 

e-l^^ - 1 

E,y(ui, U2, a;(A), to(A)) 

'^xV{ui,U2^ x{A),m{A)) - U2, X, m) 



(4) 



Using Ito's formula and Lebesgue integration properties, we obtain that: ^("i."2,3:) 
goes to ^Ylix' Dm^,V{ui,U2, x')-j^mx' + jumps, where Dm^,V is the derivative of 
y in a weak sense, - — -x- — - — > — /3, and the term 



1 _ 

'^x-r I e~^*r{ui,U2,x{t),m{t)) dt — > r{ui,U2,x,mo) 



A Jo 

when A goes to zero, and the jump term is due to the changes in the process x. 
The jump term is explicitly determined by the transitions rates which contains 
ui and the value V as given section [3771 Thus, we obtain 

PV{ui,U2,x,m) = r{ui,x,U2.x,x,m) + '^{D,n^,V{ui,U2,x' ,m))fx'{u2,m) + jumps (5) 

x' 

where Ui,x — Ui{H\x). 

The optimality is then given by the Hamilton- Jacobi-BcUman equation ob- 
tained by maximizing the right-hand side of the equation ([5|) over the action 
set. 

/3^'(a;,m) = max {r{ui^x,U2,x,x,m) + 'S^{D,n,'^{ui,U2,x\m)) fx'{u2,m) + jumps} 

x' 

and optimality conditions of the best response to U2 is given by 
/3$(M2,a;)= max {r[a,U2,xjX,m) + 'S^ [D,n ,<^{u2,x'))f x'{u2jm) + jumps} 

ae{H,D} ^ — ' 

x' 

Note that in the global optimization case (under symmetry per class strategies) 
we can drop the jump terms by computing the expected social welfare (which 
do not depend on x but depends on m). Hence the equation reduces to a similar 
one as in [5]. Now, if we consider the individual optimization problem, there is 
a jump and drift term in the generator as it is usual in hybrid systems. Theses 
equations are in general difficult to solve and the solutions are not necessarily 
regular (e.g. viscosity solutions). Numerical approaches based on multi-grid 
techniques of Hamilton- Jacobi-Bellman-Issacs equations can be found [8] . 



4.3 The case of two energy levels 

In order to derive closed form expressions for solutions of our ODE, we consider 
two states, i.e., each animal has two states x = \,2 which represents its energy 



levels. Thus, the ODE can be expressed as follows: 

rn.2it) = {I - m2{t))Li^2{u,m) - m2{t)L2^i{u,m) (6) 

which can be rewritten as 

ra2{t) = ai + a2m2{t) + a'i{m2{t)Y (7) 

with ai = 1, a2 = ^ - 2 < 0, as = > 0. 

Let m[u, mo](<) be the solution of the ODE given u and a initial distribution 
m(0) = mo. We distinguish two cases: 

Case 1 M2 = 1 (fully aggressive when it is possible): the ODE becomes rn2{t) = 
1 — |77i2(t) and the solution has the form 

m2[l,mo](t) = ^[1-Cie-i*] (8) 

with ci = 1 — |mo and mi [it, mo] (i) = 1 — 77i2[u, mo](i) 
Case 2 U2 ^ 1, (less aggressive in state 2) 

m2[u, mo](0 = 7-(^) + , ^^1"^ T^'^u . (9) 



where C2 = 1 + — 

-(2 + 

1 — U2 



,_(.) . 2 -.2/2 -(2 + .1/4).^^^ 



2-«2/2+^2+^^|/4)^ 
7+(w) = ^ > 1 

1 - U2 



Note that in both cases there is a unique strategy-dependent global attractor. 
lim m2[M, mo](t) = 



7_(w) if W2 7^ 1 
2/3 if U2 = 1 

The expected instant payoff of a player using the stationary strategy v when 
the population profile is m[u,mo](i), is given by 

r(u, M, 2, m[u, mo](t)) = v[v — cm2U2\ + (1 — v)r{v, u, 1, m[u, mo](t)) 

r{v,u, l,m[u,mo]{t)) = i(l - m2[it, mo](t)u2)w 

where m2[u,mo](t) is given by dS]) (resp. for U2 = I (resp. U2 ^ 1). Now, 
we can compute explicitly the best response against u for a given initial mo. 
Let 



/32(u, 2,TOo,0 = r{H,u,2,m[u,mo]{t)) - r{D,u,2,m[u,mo]{t)). 



The best response, BR(a;, it, m[u, mo](t)), against u at t is 

BR(x, u, mo] {t)) - | ^^^^ ^^^^ ^^^^^ x, mo, i) < 

This imphes that it is better to play Hawk for ^ > where 7 = niax(2/3, itiq). 
Since the solution of the ODE is strictly monotone in time for each stationary 
strategy, there is at most one time for which is zero. It is easy to see that if 
^ > I then the strategy which to play Hawk in state 2 and Dove in state 1 is 
an equilibrium. 



Figure 1: Global attractor for U2 = 1 



Figure 2: Global attractor for U2 = 0.2 



5 Concluding remarks 

The goal of this paper has been to develop mean field asymptotic of interactions 
with large number of players using stochastic games. Due to the curse of the size 
of the population, the applicability of atomic stochastic games has been severely 
limited. As an alternative, we proposed a method for mean field Markov games 
where players make decisions only based on their own state and the global 
system state. We have showed under mild assumptions convergence results, 
where asymptotics were taken in the number of players. The population state 
profile satisfies a system of non-linear ordinary differential equations. We have 
considered very simple class of strategies that are functions only of player's own 
state and the population profile. We applied to Hawk-Dove interaction with 
several energy level and formulated the ODEs. We show that the best response 
depends on the initial conditions. 

Appendix 

Sketch of proof of Proposition 12.2.11 

Let be the first time after i = that {t) hits in some given state. We 
show that 

^"'' = ^^ E e-^V^(xf(.),A./^(.)) (10) 
s=o step i/N 



Define for i £ N/iV: 

s=0 step 1/Af 

we have, for < s < i: 

Q := ]E(zf-Zf|^f) 
t 

u' = 

step i/N 
which can be written as 

Y^e-^^'E (E (G^(u') - {Xf{u'), M^iu')) ) 

u'=0 

step i/jv 

= 

thus is an J-[^~ martingale. Now t'^ is a stopping time with respect to the 
filtration J^l^ thus, by Doob's stopping time theorem: EZ^^^n = EZ^^^n = 
Further, Z^^j^ < K\t^\ for some constant K. Since is almost surely finite 
and has a finite expectation, we can apply dominated convergence (with t — > oo) 
and obtain EZ^ = 0. 

Sketch of Proof of Theorem 13.2.11 

To prove the weak convergence of Z^ , we check the following steps: Without 
loss of generality, we took the set of states as 5 = {0, 1,2,..., tJ5} Xj' has a 
jump r with probability 

and is the continuous process with drift f^. 

• We introduce of by scaling with step size ^. Then, Z^ = {X^ , Af^) 

is approximate in some sense by a discrete time process Z^ = {X^ , fh^) 
where m.^{k) = m{[Nt\) m solution of the ODE with X^ is the discrete 
time jump process with transition matrix 

We show that d{Xj^ , Xj^) — > for any compact of time intervals. 

Z^ = (1^,™^) (X,m) 

M^{[Nt]) — > m{t). We derive the weak convergence of Z^ to (X, m) 
where m is deterministic and X is random. 



Approximation by a discrete time process 

The following lemma follows from the lemma 1 and 3 in Benaim and Weibull 
(2003,2008), in which wc incorporate behaviorial strategies. 

Lemma 5.0.1. For every t > there exists a constant c such that for every 
e > and N large enough one has 

P{ sup ||A/^(r) -TO(r)|| > e\ M^{0)^mo,u) < 2(^5)6"^'^^ 

Q<T<T 

for all rriQ G A^, all every stationary strategy u. 

Since C is independent of and (e^^ *^)^ is summable, we can use the 
dominated convergence theorem: for all e > 0, 

^P(sMpo<r<T II M^{T)-m{T) ||oo> A M^(0) =mo,u) < «), 

N 

By Borel-Cantelli's lemma, for every fixed t < oo, the random variable := 
supg<^<j 11 (t) — m{T) Woo converges almost completely towards 0. This i^^'* 
implies that converges almost surely to 0. 

We introduce of Xf by scaling with step size ^. Then, Z'^ = {X^, M^) is 
approximate in some sense by a discrete time process = (X^, fh^) where 
m^{k) = m{lNt\) m solution of the ODE where Xj^ is the discrete time jump 
process with transition matrix 

1 k 

qf!t+rirn^ (k)) = —L,^i+r{m{ — ),u)). 

Using the lemma 15.0.11 and uniform Lipschitz continuity of of , we obtain 
that 

sup sup lUf^.(Af^(r))-q,,,(m(r)) || 

i,j 0<T<t 

<K{eN+ sup ||Af^(T) -m(T)ll). 

0<T<t 

Hence, we can write \\M^{t) — < K{eN + 772) over set of event fig = 

{\\M^{t) - TO(r)|| < e} and P(a) > 1 - 2(ttS')e-'^''^^ ^ 1. Thus, 

P(X^j- |[o,t] = ^jj[o,i]|fc transitions) > E(e-«'"(^'^*)) 

E(e^™(^'^*)) = (1 - 1 + le)^* 

P(X^j.|[o.t] = ^f\[n t]\k transitions) > 

and this holds for any e arbitrary small. Wc define d{X, Y) ~ X]/c=o ^^(^/ci ^fc) 
where d{Xu,Yk) = Ix^^n- Then, d(X^j^|[o^(], Xj^jg^j) — > when N goes to 
infinity. 

Convergence of the discrete time process To prove the weak convergence 
of [X^ ,M^), we check the following steps: 



• the discrete time empirical measures are tight (follows from Sznitman 
for finite states) and converges to a martingale problem. The limit m 
is deterministic measure and is solution of ODE which has the unique 
solution m (given toq, u). Thus, fh — m. 

• Conditionally to A/^, X^^ converges to a martingale problem. The jump 
and drift process X with time dependent transition is given by the limit of 
the marginal of ^^(.|M^, mg, Xq, u). We derive the weak convergence of 
(Xj^, M^) to (X, fh) where fh is deterministic and X is random. For this 
we use the Theorem 17.25 and its discrete time approximation in Theorem 
17.28 pages 344-347 in Kallenberg. 

Sketch of Proof of Theorem 13.3.11 

Since Skorohod's topology is induced by a metric, it is sufficient to show that 
whenever (Xj^,m^) — > (a;,™) in Skorohod's topology, we have: 



lim / e-P'r'\v,X"{t),m"{t))dt 



N- 

/>oc 

e~^^r{v,x{t),m{t))dt 

By [7], page 117, there is some sequence of increasing bijections A„: [0, oo) 
[0, oo) s.t. 

A„(t)-A„(s) 1 • ^ J 
> 1 uniformly m t and s 

t — s 

and II yn{t) — y{^n{t)) \\^ uniformly in t 
over compact subsets of [0, oo). Fix e > 0, arbitrary and consider 

/"OO 

/i^ := I / e-''V^(X^(t),z;,TO^(t))di 
Jo 

e'^*r{x{t),v,m{t))dt\ 
< I e-f^'\r^{x'^{t),v,m^{t))-r{x{t),v,m{t))\dt 









First let K = sup^g^ \r{x,v,m)\ < oo by hypothesis, and pick some 

time T large enough such that < e/3. Thus 

/■^ 

h^<e/3+ e-^*\r{x^it),v,m^{t)) -r{xit),v,m{t))\dt (11) 
Jo 

Second, we use the distance on E defined by 

d{{x, m), (x', m')) =11 m - m' \\ +1,^,. (12) 



\r{x,v,m) -r{x',v,m')\ 
Let A = sup —r. < oo 

xes,v.meAa \\m-m' \\ 

by hypothesis. It is easy to see that for all x, x' e S and m, m' G A^: 

II r{x,v,,m) -r{x',v,m') ||< K'd{{x,m), (x' ,m')) (13) 
Thus, by Equation (fTT|) : 

/i'^ < e/3 + X' / e-f^'d ((x^(t), m^(t)), TO(t))) dt (14) 

By H], page 117, there is some sequence of increasing bijections A^: [0, oo) — >■ 
[0, oo) s.t. 

A^(i)-A^(s) ^ 1 • . ^ 
> 1 uniiormly m t and s 

t — s 

andd((a;^(t),TO^(t), (x^(A^(t)), m^(A^(t)))) ^0 

uniformly in t over compact subsets of [0, oo). Thus there is some A^o G N such 
that for N >Na and te [0, T]: 

d {{x^itl m^itl (x^(A^(0), m^iX^m) < ^ (15) 
Thus, by the triangular inequality for d: < 

<^+K' r e-'^'d{{x^{t)m^{t)),{x{X^{t)),m{X^{t))) dt 
3 Jo 

+K' r e-^*d {{x{X^{t)), m(A^(i))), {x{t), m{t)) dt 
Jo 

< |+A"^^e-^*d((x(A^(0),m(A^(i))),(x(<),m(i))) dt (16) 

Third, let D be the set of discontinuity points of {x,m). Since {x,m) is 
cadlag, D is enumerable, thus it is negligible for the Lebesgue measure and 

T 

e-P'd ((x(A^(<)), a, m{X^{t))), {x{t),a, m{t)) dt 

^ e-^'d ((x(A^(t)), m(A^(t))), {x{t),m{t)) U^^dt 

Now limAT >oo A^(i) = t and thus ior t ^ D 

lim d{{x{X^{t)),m{X^{t))),{x{t),m{t)) =0 

A'" — ^oo 



and thus by dominated convergence 



lim / e-'^*d({x{X^{t)),m{X^{t))),{x{t),m{t)))dt = (17) 

N — >oo Jo 

and for N large enough the second term in the right-hand side of Equation (|16p 
can be made smaller than e/3. Finally, for TV large enough, < e. This 
completes the proof. 

Sketch of Proof of Theorem IHX^ 

Define the discounted stochastic evolutionary game with random number of 
interacting players in each local interaction in which each player in x with the 
mixed action u{.\x) receives r{u,x,m{t)) where m{t) is the population profile 
at t, which evolves under the dynamical system ([3]) and the between states 
follows the transition kernel L. Then, a strategy of a player is the same as in 
the microscopic case and the discounted payoffs 

POO 

Jo 

is the limit of {ui,U2, so,mo) when N goes to infinity, where to[u2] is the 
solution of the ODE m = /(u2, m), m(0) = mo . It follows that the asymptotic 
regime of the microscopic game and the Markov decision evolutionary game 
(macroscopic game) are equivalent. 

Sketch of Proof of Theorem 13.6.11 

We show that for every discount factor /? > the optimal control problem 
[OPTm) (resp. the fixed-point problem (FIXn)) has at least one 0— optimal 
strategy. It follows from the existence of equilibria in stationary strategies for 
finite stochastic games with discounted payoff: The set of pure strategies is a 
compact space in the product topology (Tykhonov theorem). Thus, the set 
of behavioral strategies Sj is a compact space and also convex as the set of 
probabilities on the pure strategies. For every player j and every strategy profile 
a the marginal of the payoffs and constraints functions are continuous for any 
f3 > : aj I — > Rj^ {aj, <J-j, s, mo). Moreover, the stationary strategies is convex, 
compact and upper and lower hemi-continuous (as a correspondence). Define 

7, (s, mo, cr) = arg max R^ (a,-, cr_, , s, mo). 

Then, jj{mo, cr) Q Sj is a non-empty, convex and compact set and the product 
correspondence 

7 : cr I — > (7i(s,mo,cr), . . . , 77v(s, too, ct)) 

is upper hemi-continuous (its graph is closed) . We now use the Glicksberg gen- 
eralization of Kakutani fixed point theorem, and there is a stationary strategy 



profile a* such that 

a* g 7(s,mo,cr*). 

Moreover, if the game has symmetric payoffs and strategies for each type, there 
is a symmetric per type stationary equilibrium. This completes the proof. 



Sketch of Proof of Theorem ISX^ 

Let {U^)n be a sequence of solution of [FIX^) i.e equilibrium in the system 
with N players. Choose a subsequence Nk such that U^'' converges to some 
point u when k goes to infinity. We can write 

R^" [U^" , U^')-R{U, U) = R^" {U^" , U^')^R^' (C/, U)+R^'' [U, U)-R{U, U). 

Since is continuous and converges uniformly to R{.,.), R'^'' converges 

uniformly to R, the second term R^'' {U, U)~R{U^ U) — > when — > oo and 
the first term 7?^*= {U^" , U^" )-R^'^ (C/, U) can be rewritten as R^" {U^^ ,U^^)- 
RNk (c/^ [/) = rN, (^jjN, ^ ^ u^>')+RiU^>' , U^^)-R{U, U)+R{U, U)- 

R^'' {U, U). Each term goes to zero by continuity of R, convergence of J7^'= to C/, 
and by uniform convergence of R^ to R. Let be a e^v— equilibrium. Then, 
R^{U^, U^) > R^{v, U^) - EAT, yv. Then any hmit [/ of a subsequence of 
satisfies R{U, U) > R{v, U) - e, Vw. Similarly, if 

i?^(C/^,C/^)>i?^(i;,i;)-ew, V« 

then any omega-limit U of the sequence of satisfies R{U, U) > R{v, v) — 
e, \/v i.e U is an e— optimal strategy. In particular if {U^)n is a sequence of 
CAT— equilibria (resp. optimal strategies) with — > when goes to infinity 
then any accumulation point U of {U^)n is a 0— equilibrium (rcsp. 0— optimal 
strategy). 
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